Name: Suman Kumar

Purpose: modeling with sklearn

Status: Incomplete

Part: /

This is an attempt to use certain tools of data analytics and also use of certain statistical elements .

The Data used here is obtained form nhanes site(public data). This is also a part of online course on coursera.

Multivariate relationships and Visualization Techniques

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [8]:
da=pd.read_csv('B:\\Python Learnings\\Codes\\NHANES_dataset\\nhanes_2015_2016.csv')
da.head(2)
Out[8]:
SEQN ALQ101 ALQ110 ALQ130 SMQ020 RIAGENDR RIDAGEYR RIDRETH1 DMDCITZN DMDEDUC2 ... BPXSY2 BPXDI2 BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST HIQ210
0 83732 1.0 NaN 1.0 1 1 62 3 1.0 5.0 ... 124.0 64.0 94.8 184.5 27.8 43.3 43.6 35.9 101.1 2.0
1 83733 1.0 NaN 6.0 1 1 53 3 2.0 3.0 ... 140.0 88.0 90.4 171.4 30.8 38.0 40.0 33.2 107.9 NaN

2 rows × 28 columns

In [16]:
# help(sns.regplot)
sns.regplot(x='BMXARML',y='BMXLEG',data=da,dropna=True,scatter_kws={"alpha": 0.4},marker='+',color='red')
# p.map(plt.scatter,x,y,color='green')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xf2f5c9a488>
In [22]:
# help(sns.jointplot)
sns.jointplot(x='BMXARML',y='BMXLEG',data=da,kind='kde', dropna=True)
Out[22]:
<seaborn.axisgrid.JointGrid at 0xf2f67eec48>
In [ ]:
#conclusion of above two statements
In [23]:
#relation between the blood pressure in humans
In [35]:
sns.set_palette("Paired")
sns.jointplot(x='BPXSY1',y='BPXDI1',data=da,dropna= True,kind='kde')
Out[35]:
<seaborn.axisgrid.JointGrid at 0xf2f9c1dc88>
In [36]:
sns.jointplot(x='BPXSY1',y='BPXSY2',data=da,dropna= True,kind='kde')
Out[36]:
<seaborn.axisgrid.JointGrid at 0xf2f9ce8088>
In [37]:
#high correlation between the data can be observed.(bp of same person few mins apart: seems obvious)
In [47]:
da['RIAGENDRX']=da.RIAGENDR.replace({1:'Male',2:'Female'})
# # help(sns.FacetGrid)
sns.FacetGrid(da,row="RIAGENDRx").map(plt.scatter, "BMXLEG","BMXARML",alpha=0.4,marker='+',color='green').add_legend()
Out[47]:
<seaborn.axisgrid.FacetGrid at 0xf2fcbdc748>
In [49]:
print(da.loc[da.RIAGENDRx=="Male", ["BMXLEG", "BMXARML"]].dropna().corr())
print(da.loc[da.RIAGENDRx=="Female", ["BMXLEG", "BMXARML"]].dropna().corr())
           BMXLEG   BMXARML
BMXLEG   1.000000  0.505426
BMXARML  0.505426  1.000000
           BMXLEG   BMXARML
BMXLEG   1.000000  0.434703
BMXARML  0.434703  1.000000
The above data will give correlation between limbs of male and female
In [52]:
# check the use of pair plot
sns.FacetGrid(da,col='RIDRETH1',row='RIAGENDRx').map(plt.scatter,"BMXLEG","BMXARML",alpha=0.6,marker='+',color='red').add_legend()
Out[52]:
<seaborn.axisgrid.FacetGrid at 0xf2fedcc548>
In [54]:
#Categorical bivariate data
In [60]:
da["DMDEDUC2x"]=da.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 
                                       7: "Refused", 9: "Don't know"})
da["DMDMARTLx"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married",
                                      6: "Living w/partner", 77: "Refused"})
db=da.loc[(da.DMDEDUC2x!="Don't Know")&(da.DMDMARTLx!="refused"),:]
da.head(3)
Out[60]:
SEQN ALQ101 ALQ110 ALQ130 SMQ020 RIAGENDR RIDAGEYR RIDRETH1 DMDCITZN DMDEDUC2 ... BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST HIQ210 RIAGENDRX RIAGENDRx DMDEDUC2x DMDMARTLx
0 83732 1.0 NaN 1.0 1 1 62 3 1.0 5.0 ... 27.8 43.3 43.6 35.9 101.1 2.0 Male Male College Married
1 83733 1.0 NaN 6.0 1 1 53 3 2.0 3.0 ... 30.8 38.0 40.0 33.2 107.9 NaN Male Male HS/GED Divorced
2 83734 1.0 NaN NaN 1 1 78 3 1.0 3.0 ... 28.8 35.6 37.0 31.0 116.5 2.0 Male Male HS/GED Married

3 rows × 32 columns

In [59]:
db.head(2)
Out[59]:
SEQN ALQ101 ALQ110 ALQ130 SMQ020 RIAGENDR RIDAGEYR RIDRETH1 DMDCITZN DMDEDUC2 ... BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST HIQ210 RIAGENDRX RIAGENDRx DMDEDUC2x DMDMARTLx
0 83732 1.0 NaN 1.0 1 1 62 3 1.0 5.0 ... 27.8 43.3 43.6 35.9 101.1 2.0 Male Male College Married
1 83733 1.0 NaN 6.0 1 1 53 3 2.0 3.0 ... 30.8 38.0 40.0 33.2 107.9 NaN Male Male HS/GED Divorced

2 rows × 32 columns

Plan to create contingency table with respect to education status and marriage status

In [65]:
# help(pd.crosstab)
# x=pd.crosstab(db.DMDEDUC2x,da.DMDMARTLx,da.RIAGENDRx)
x=pd.crosstab(db.DMDEDUC2x,da.DMDMARTLx)
x
Out[65]:
DMDMARTLx Divorced Living w/partner Married Never married Refused Separated Widowed
DMDEDUC2x
9-11 62 80 305 117 0 39 40
<9 52 66 341 65 0 43 88
College 120 85 827 253 0 22 59
Don't know 1 0 0 0 0 0 2
HS/GED 127 133 550 237 0 40 99
Some college/AA 217 163 757 332 2 42 108
In [68]:
x.apply(lambda z: z/z.sum(), axis=1)
Out[68]:
DMDMARTLx Divorced Living w/partner Married Never married Refused Separated Widowed
DMDEDUC2x
9-11 0.096423 0.124417 0.474339 0.181960 0.000000 0.060653 0.062208
<9 0.079389 0.100763 0.520611 0.099237 0.000000 0.065649 0.134351
College 0.087848 0.062225 0.605417 0.185212 0.000000 0.016105 0.043192
Don't know 0.333333 0.000000 0.000000 0.000000 0.000000 0.000000 0.666667
HS/GED 0.107083 0.112142 0.463744 0.199831 0.000000 0.033727 0.083474
Some college/AA 0.133868 0.100555 0.466996 0.204812 0.001234 0.025910 0.066626
In [69]:
x.apply(lambda z: z/z.sum(), axis=0)
Out[69]:
DMDMARTLx Divorced Living w/partner Married Never married Refused Separated Widowed
DMDEDUC2x
9-11 0.107081 0.151803 0.109712 0.116534 0.0 0.209677 0.101010
<9 0.089810 0.125237 0.122662 0.064741 0.0 0.231183 0.222222
College 0.207254 0.161290 0.297482 0.251992 0.0 0.118280 0.148990
Don't know 0.001727 0.000000 0.000000 0.000000 0.0 0.000000 0.005051
HS/GED 0.219344 0.252372 0.197842 0.236056 0.0 0.215054 0.250000
Some college/AA 0.374784 0.309298 0.272302 0.330677 1.0 0.225806 0.272727
In [72]:
# db.groupby(["RIAGENDRx","DMDEDUC2x","DMDMARTLx","DMDCITZN"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(),axis=1)
db.groupby(["RIAGENDRx","DMDEDUC2x","DMDMARTLx",]).size().unstack().fillna(0).apply(lambda x: x/x.sum(),axis=1)
Out[72]:
DMDMARTLx Divorced Living w/partner Married Never married Refused Separated Widowed
RIAGENDRx DMDEDUC2x
Female 9-11 0.113402 0.123711 0.412371 0.171821 0.000000 0.075601 0.103093
<9 0.091691 0.091691 0.424069 0.108883 0.000000 0.088825 0.194842
College 0.110181 0.055788 0.577406 0.182706 0.000000 0.016736 0.057183
Don't know 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
HS/GED 0.121784 0.109777 0.413379 0.188679 0.000000 0.041166 0.125214
Some college/AA 0.148515 0.099010 0.418042 0.210121 0.001100 0.031903 0.091309
Male 9-11 0.082386 0.125000 0.525568 0.190341 0.000000 0.048295 0.028409
<9 0.065359 0.111111 0.630719 0.088235 0.000000 0.039216 0.065359
College 0.063174 0.069337 0.636364 0.187982 0.000000 0.015408 0.027735
Don't know 0.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000
HS/GED 0.092869 0.114428 0.512438 0.210614 0.000000 0.026534 0.043118
Some college/AA 0.115169 0.102528 0.529494 0.198034 0.001404 0.018258 0.035112
In [73]:
dx = db.loc[(db.RIDAGEYR >= 40) & (db.RIDAGEYR < 50)]
a = dx.groupby(["RIAGENDRx", "DMDEDUC2x", "DMDMARTLx"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)

dx = db.loc[(db.RIDAGEYR >= 50) & (db.RIDAGEYR < 60)]
b = dx.groupby(["RIAGENDRx", "DMDEDUC2x", "DMDMARTLx"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)

print(a.loc[:, ["Married"]].unstack())
print("")
print(b.loc[:, ["Married"]].unstack())
DMDMARTLx   Married                                              
DMDEDUC2x      9-11        <9   College    HS/GED Some college/AA
RIAGENDRx                                                        
Female     0.581818  0.464286  0.713376  0.476744        0.509554
Male       0.574074  0.714286  0.879310  0.616279        0.625000

DMDMARTLx   Married                                              
DMDEDUC2x      9-11        <9   College    HS/GED Some college/AA
RIAGENDRx                                                        
Female     0.490566  0.511111  0.648649  0.563107        0.496403
Male       0.666667  0.622642  0.737374  0.637255        0.555556

Interpret the result

In [75]:
plt.figure(figsize=(12, 4))
a = sns.boxplot(db.DMDMARTLx, db.RIDAGEYR)
In [76]:
plt.figure(figsize=(12, 4))
a = sns.violinplot(da.DMDMARTLx, da.RIDAGEYR)
In [77]:
sns.pairplot(da)
C:\Users\TOSHABA\Anaconda3\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\TOSHABA\Anaconda3\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
Out[77]:
<seaborn.axisgrid.PairGrid at 0xf2fc48f748>
In [78]:
sns.pairplot(db)
Out[78]:
<seaborn.axisgrid.PairGrid at 0xf2996c3f08>
In [79]:
db.head()
Out[79]:
SEQN ALQ101 ALQ110 ALQ130 SMQ020 RIAGENDR RIDAGEYR RIDRETH1 DMDCITZN DMDEDUC2 ... BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST HIQ210 RIAGENDRX RIAGENDRx DMDEDUC2x DMDMARTLx
0 83732 1.0 NaN 1.0 1 1 62 3 1.0 5.0 ... 27.8 43.3 43.6 35.9 101.1 2.0 Male Male College Married
1 83733 1.0 NaN 6.0 1 1 53 3 2.0 3.0 ... 30.8 38.0 40.0 33.2 107.9 NaN Male Male HS/GED Divorced
2 83734 1.0 NaN NaN 1 1 78 3 1.0 3.0 ... 28.8 35.6 37.0 31.0 116.5 2.0 Male Male HS/GED Married
3 83735 2.0 1.0 1.0 2 2 56 3 1.0 5.0 ... 42.4 38.5 37.7 38.3 110.1 2.0 Female Female College Living w/partner
4 83736 2.0 1.0 1.0 2 2 42 4 1.0 4.0 ... 20.3 37.4 36.0 27.2 80.4 2.0 Female Female Some college/AA Divorced

5 rows × 32 columns

In [ ]:
 
In [ ]: